RStudio became Posit on November 2, 2022. (https://posit.co/)
tidyverse package attaches ggplot2,
purrr, tibble, dplyr,
tidyr, stringr, readr,
forcats as well.rmarkdown package to analyze, share and
reproduce.We need both R and RStudio. If you have not installed R and RStudio, please follow the instruction in ModernDive: Chapter 1 Getting Started with Data in R .
R is ‘GNU S’, a freely available language and environment for statistical computing and graphics which provides a wide variety of statistical and graphical techniques: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.
The R Project for Statistical Computing: https://www.r-project.org
Update R
Currently, R version 4.3.1 (2023-06-16) is running on this system, recommended to update once a year. If you are Windows user, you can use the following code.
if( !("installr" %in% installed.packages()) ){install.packages("installr")}
installr::updateR(TRUE)
System Language
It is more convenient to set the system language of R to be English. You can copy and paste error message for search solutions.
Sys.setenv(LANG = "en")
Posit Site: https://posit.co
Our mission is to create open-source software for data science, scientific research, and technical communication. We do this to enhance the production and consumption of knowledge by everyone, regardless of economic means. (About US: Open source for a better world)
Update RStudio
Top Menu > Help > Check Updates
R packages are extensions to the R statistical programming language. R packages contain code, data, and documentation in a standardised collection format that can be installed by users of R, typically via a centralised software repository such as CRAN (the Comprehensive R Archive Network).
Install packages
We need two packages, rmarkdown and
tidyverse. Install these by the following code. You can
also copy install.packages(c("rmarkdown", "tidyverse")) and
paste in Console. You can also use Tools > Install Packages in the
top menu.
tidyverse: https://www.tidyverse.org, https://cran.r-project.org/package=tidyversermarkdown: https://rmarkdown.rstudio.com, https://cran.r-project.org/package=rmarkdownThe standard link to an R package is in the form:
https://cran.r-project.org/package=<Package Name>.
install.packages(c("rmarkdown", "tidyverse"))
The following code does the same.
c("rmarkdown", "tidyverse") is a vector notation in R to
concatenate, or combine, two values “rmarkdown”, and “tidyverse”.
install.packages("rmarkdown")
install.packages("tidyverse")
When you use a package you need to attach it by the following code.
library(tidyverse)
── Attaching core tidyverse packages ───────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.3 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2 ── Conflicts ─────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Tidyverse is a collection of packages designed with consistency. If you are interested in the undelying philosophy, please take a look at Tidyverse design guide. See the messages of attaching packages and conflicts.
Create an account of Posit (RStudio) Cloud: https://posit.cloud/
GET STARTED FREE
We mainly use one of the formats of R Markdwon called R Notebook.
Let us assign the iris data in the pre-installed package
datasets to df_iris. You can give any name
starting from an alphabet, though there are some rules.
df_iris <- datasets::iris
class(df_iris)
[1] "data.frame"
The class of data iris is data.frame, the
basic data class of R. You can assign the same data as a
tibble, the data class of tidyverse as
follows.
tbl_iris <- as_tibble(datasets::iris)
class(tbl_iris)
[1] "tbl_df" "tbl" "data.frame"
df_iris <- iris can replace
df_iris <- datasets::iris because the package
datasets is installed and attached as default. Since you
may have other data called iris included in a differenct
package or you may have changed iris before, it is safer to
specify the name of the package with the name of the data.tf_iris and tbl_iris behave differently. It is
because of the default settings of R Markdown.The View command open up a window to show the contents
of the data and you can use the filter as well.
View(df_iris)
The following simple command also shows the data.
df_iris
The output within R Notebook is a tibble style. Try the same command in Console.
slice(df_iris, 1:10)
11+2
[1] 13
1:10
[1] 1 2 3 4 5 6 7 8 9 10
seq(1,10, by = 2)
[1] 1 3 5 7 9
Let us look at the structure of the data. You can try
str(df_iris) on Console or by adding a code chunk in R
Notebook introducing later.
glimpse(df_iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, …
$ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, …
$ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, …
$ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, …
$ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, se…
There are six types of data in R; Double, Integer, Character, Logical, Raw, Complex.
The names after $ are column names. If you call
df_iris$Species, you have the Species column. Species is in
the 5th collumn, typeof(df_iris[[5]]) does the same as the
next. `df_iris[2,4] = `0.2 is the fourth entry of
Sepal.Width.
typeof(df_iris$Species)
[1] "integer"
class(df_iris$Species)
[1] "factor"
For factors = fct see the
R Document or an explanation in Factor
in R: Categorical Variable & Continuous Variables.
typeof(df_iris$Sepal.Length)
[1] "double"
class(df_iris$Sepal.Length)
[1] "numeric"
Q1. What are the differences ofdf_iris,
slice(df_iris, 1:10) and glimpse(df_iris)
above?
Q2. What are the differences ofdf_iris,
slice(df_iris, 1:10) and glimpse(df_iris) in
the console?
The following is very convenient to get the summary information of a data.
summary(df_iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :50
versicolor:50
virginica :50
Minimum, 1st Quadrant (25%), Median, Mean, 3rd Quadrant (75%), Maximum, and the count of each factor.
We use ggplot to draw graphs. The scatter plot is a
projection of data with two variables \(x\) and \(y\).
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
geom_point()
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point()
Add title and labels adding labs(). See Modify axis,
legend, and plot labels.
ggplot(data = <data>, aes(x = <column name for x>, y = <column name for y>)) +
geom_point() +
labs(title = "Title", x = "Label for x", y = "Label for y")
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
labs(title = "Scatter Plot of Sepal Data of Iris", x = "Sepal Length", y = "Sepal Width")
Add different colors automatically to each spieces. See Colour related aesthetics: colour, fill, and alpha. Can you see each group?
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point()
The boxplot compactly displays the distribution of a continuous variable. See A box and whiskers plot (in the style of Tukey).
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
geom_boxplot()
Visualize the distribution of a single continuous variable by dividing the x axis into bins and counting the number of observations in each bin. Histograms (geom_histogram()) display the counts with bars. See Histograms and frequency polygons.
ggplot(data = df_iris, aes(x = Sepal.Length)) +
geom_histogram()
Change the number of bins by bins =
<number>.
ggplot(data = df_iris, aes(x = Sepal.Length)) +
geom_histogram(bins = 10)
Three denity plots of the same data.
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
geom_violin()
ggplot(data = df_iris, aes(x = Species, y = Sepal.Length)) +
geom_jitter(width = 0.2)
ggplot(data = df_iris, aes(x = Sepal.Length, fill = Species)) +
geom_density(alpha = 0.5)
We will not get into the mathematical models and hypothesis testings. The following is a simple linear regression model and the graph of the fitted line.
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
lm(Sepal.Width ~ Sepal.Length, data = df_iris)
Call:
lm(formula = Sepal.Width ~ Sepal.Length, data = df_iris)
Coefficients:
(Intercept) Sepal.Length
3.41895 -0.06188
The formula of the line in the scattered plot is the following. \[y = (-0.06188)x + 3.41895\] The slope is \(-0.06188\) and the \(y\)-intercept is 3.41895.
It is a little strange and counter intuitive, because it claims that as the sepal width gets smaller as the sepal length gets larger. What is hidden behind?
lm(Sepal.Width ~ Sepal.Length, data = df_iris) |> summary()
Call:
lm(formula = Sepal.Width ~ Sepal.Length, data = df_iris)
Residuals:
Min 1Q Median 3Q Max
-1.1095 -0.2454 -0.0167 0.2763 1.3338
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.41895 0.25356 13.48 <2e-16 ***
Sepal.Length -0.06188 0.04297 -1.44 0.152
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4343 on 148 degrees of freedom
Multiple R-squared: 0.01382, Adjusted R-squared: 0.007159
F-statistic: 2.074 on 1 and 148 DF, p-value: 0.1519
ggplot(data = df_iris, aes(x = Sepal.Length, y = Sepal.Width, color = Species)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
ggplot(df_iris, aes(x = Sepal.Length, y = Sepal.Width)) +
geom_point(aes(color = Species)) +
geom_smooth(aes(color = Species), formula = y ~ x, method = "lm", se = FALSE) +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE, color = "black", linetype = "twodash", size = 1.5)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.
The column Species is a confounding variable. See Wikipedia: Confounding, Confounding Variables | Definition, Examples & Controls.
Open RStudio first.
Run the following code by clicking the triangle or Run Current Chunk on Top bar. Alternately, you can use New Folder button in the Files tab in the right bottom pane.
dir.create("data")
Choose R Notebook from the pull down File menu in the top bar.
Default* is as follows
---
title: "R Notebook"
output: html_notebook
---
Template
---
title: "Title of R Notebook"
author: "ID and Your Name"
date: "2023-10-15"
output:
html_notebook:
# number_sections: yes
# toc: true
# toc_float: true
---
number_sections: no.toc: true - default is
toc: false.toc_float: true - default is
toc_float: falseInsert Chunk in Code pull down menu in the top bar, or use the C button on top. You can use shortcut keys listed under Tools in the top bar.
library(tidyverse)
There are many sites we can get data. Here we learn how to get two
types of data, i.e, a data with comma separated values in
*.csv, and a Microsoft Excel Data in *.xslx or
*.xsl in three ways. We have used a data in a package of R,
e.g. datasets::iris, and there are many other data packages
of R. However, we focus on how we get so called open public data. The
folowing is the definition of Open Data by World Bank.
The term “Open Data” has a very precise meaning. Data or content is open if anyone is free to use, re-use or redistribute it, subject at most to measures that preserve provenance and openness.
1. The data must be , which means they must be placed in the public domain or under liberal terms of use with minimal restrictions.
2. The data must be , which means they must be published in electronic formats that are machine readable and non-proprietary, so that anyone can access and use the data using common, freely available software tools. Data must also be publicly available and accessible on a public server, without password or firewall restrictions. To make Open Data easier to find, most organizations create and manage Open Data catalogs.
See a list of open public data below.
data folderurl_class <- "<URL of the data>"
download.file(url = url_class, destfile = "data/<file name>")
Copy the link by Right Click or Ctrl+Click and paste it in the URL.
url <- "https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv" # long file name
(pop <- read_csv(url))
New names:Rows: 7874 Columns: 7── Column specification ─────────────────────────────────────────────────
Delimiter: ","
chr (7): T02, Population, density and surface area, ...3, ...4, ...5,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
(pop <- read_csv(url)) is a short hand of the
following:
pop <- read_csv(url)
pop
The first row is not the column names, so skip the first row.
url <- "https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv"
(pop <- read_csv(url, skip = 1))
New names:Rows: 7873 Columns: 7── Column specification ─────────────────────────────────────────────────
Delimiter: ","
chr (4): ...2, Series, Footnotes, Source
dbl (2): Region/Country/Area, Year
num (1): Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
It is better to save it in the data folder.
url <- "https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv"
download.file(url = url, destfile = "data/pop.csv")
trying URL 'https://data.un.org/_Docs/SYB/CSV/SYB65_1_202209_Population,%20Surface%20Area%20and%20Density.csv'
Content type 'application/octet-stream' length 2290838 bytes (2.2 MB)
==================================================
downloaded 2.2 MB
pop <- read_csv("data/pop.csv", skip = 1)
New names:Rows: 7873 Columns: 7── Column specification ─────────────────────────────────────────────────
Delimiter: ","
chr (4): ...2, Series, Footnotes, Source
dbl (2): Region/Country/Area, Year
num (1): Value
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pop
read_excel, read_xls,
read_xlsxurl_class <- "https://datacatalogfiles.worldbank.org/ddh-published/0037712/DR0090755/CLASS.xlsx"
download.file(url = url_class, destfile = "data/CLASS.xlsx", mode = "wb")
trying URL 'https://datacatalogfiles.worldbank.org/ddh-published/0037712/DR0090755/CLASS.xlsx'
Content type 'application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' length 94625 bytes (92 KB)
==================================================
downloaded 92 KB
Let us look at the first sheet.
library(readxl)
wb_countries_tmp <- read_excel("data/CLASS.xlsx", sheet = 1, n_max =218)
wb_countries <- wb_countries_tmp |>
select(country = Economy, iso3c = Code, region = Region, income = `Income group`, lending = "Lending category")
wb_countries
readxl: https://readxl.tidyverse.orgreadxl is a part of the tidyverse family but
not automatically attached. So attach it by
library(readxl).read_excel, read_xls,
read_xlsxwb_regions_tmp <- read_excel("data/CLASS.xlsx", sheet = 1, skip = 0, n_max =267) |>
slice(-(1:219))
wb_regions <- wb_regions_tmp |>
select(region = Economy, iso3c = Code) |> drop_na()
wb_regions
data Folderobject <- read_csv("data/<file name>)
Install WDI package for the first time by
install.packages("WDI"). See the
Setup Section Above.
install.packages("WDI")
library(WDI)
Try ?WDI in Console or WDI in the Help tab
on the right buttom pane.
WDI(country = "all",
indicator = "NY.GDP.PCAP.KD",
start = 1960,
end = 2025,
extra = FALSE,
cache = NULL)
Arguments - country: Vector of countries (ISO-2
character codes, e.g. “BR”, “US”, “CA”, or “all”) - indicator: If you
supply a named vector, the indicators will be automatically renamed:
c('women_private_sector' = 'BI.PWK.PRVS.FE.ZS')
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", cache = NULL)
WDIsearch(string = "NY.GDP.PCAP.KD",
field = "indicator", short = FALSE, cache = NULL)
Default: short = TRUE
WDIsearch(string = "gdp per capita",
field = "name", cache = NULL)
WDIsearch(string = "6.0.GDP_current", field = "indicator", short = FALSE)
Another way to find the indicator
Go to the World Bank Open Data site and select Browse by Indicators and find the data. Then the indicator is included in the URL.
CO2 emissions (metric tons per capita)
Indicator: EN.ATM.CO2E.PC
Check the indicator by WDIsearch.
WDIsearch(string = "EN.ATM.CO2E.PC", field = "indicator", short = FALSE)
Further Examples
WDIsearch(string = "population, total", field = "name", short = FALSE)
WDIsearch(string = "SP.POP.TOTL", field = "indicator", short = FALSE)
Exercise. Try the following on Console or in a new code chunk, and choose one interesting data with its indicator, name and description.
?WDIsearchView(WDIsearch(string = "gdp per capita", field = "name", cache = NULL))View(WDIsearch(string = "gdp per capita", field = "name", short = FALSE, cache = NULL))View(WDIsearch(string = "gdp", field = "name", short = FALSE, cache = NULL))View(WDIsearch(string = "gender", field = "name", short = FALSE, cache = NULL))WDIDescription: GDP per capita is gross domestic product divided by midyear population. GDP is the sum of gross value added by all resident producers in the economy plus any product taxes and minus any subsidies not included in the value of the products. It is calculated without making deductions for depreciation of fabricated assets or for depletion and degradation of natural resources. Data are in constant 2015 U.S. dollars.
gdpcap <- WDI(country = "all", indicator = "NY.GDP.PCAP.KD")
gdpcap
Avoiding downloading the data repeatedly as a CSV file, let us sage
it in the data folder created above.
write_csv(gdpcap, "data/gdpcap.csv")
We can download two data as one data! The GDP per capita data and the
CO2 emission per capita data with indicator
EN.ATM.CO2E.PC.
Description: Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring.
gdpcap_co2 <- WDI(country = "all", indicator = c("NY.GDP.PCAP.KD", "EN.ATM.CO2E.PC"), extra = TRUE)
gdpcap_co2
Let us save this data as well. It is very large, i.e., 14 columns and 16,492 rows, 1.9MB.
write_csv(gdpcap_co2, "data/gdpcap_co2.csv")
dplyr
Overviewdplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:
select() picks variables based on their names.filter() picks cases based on their values.arrange() changes the ordering of the rows.mutate() adds new variables that are functions of
existing variablesgroup_by() takes an existing tbl and converts it into a
grouped tbl.summarise() reduces multiple values down to a single
summary.You can learn more about them in vignette(“dplyr”). As well as these single-table verbs, dplyr also provides a variety of two-table verbs, which you can learn about in vignette(“two-table”).
If you are new to dplyr, the best place to start is the data transformation chapter in R for data science.
select:
Subset columns using their names and types| Helper Function | Use | Example |
|---|---|---|
| - | Columns except | select(babynames, -prop) |
| : | Columns between (inclusive) | select(babynames, year:n) |
| contains() | Columns that contains a string | select(babynames, contains(“n”)) |
| ends_with() | Columns that ends with a string | select(babynames, ends_with(“n”)) |
| matches() | Columns that matches a regex | select(babynames, matches(“n”)) |
| num_range() | Columns with a numerical suffix in the range | Not applicable with babynames |
| one_of() | Columns whose name appear in the given set | select(babynames, one_of(c(“sex”, “gender”))) |
| starts_with() | Columns that starts with a string | select(babynames, starts_with(“n”)) |
filter:
Subset rows using column values| Logical operator | tests | Example |
|---|---|---|
| > | Is x greater than y? | x > y |
| >= | Is x greater than or equal to y? | x >= y |
| < | Is x less than y? | x < y |
| <= | Is x less than or equal to y? | x <= y |
| == | Is x equal to y? | x == y |
| != | Is x not equal to y? | x != y |
| is.na() | Is x an NA? | is.na(x) |
| !is.na() | Is x not an NA? | !is.na(x) |
arrange
and Pipe |>arrange() orders the rows of a data frame by the values
of selected columns.Unlike other dplyr verbs, arrange() largely
ignores grouping; you need to explicitly mention grouping variables
(or use .by_group = TRUE) in order to group by them, and functions of variables are evaluated once per data frame, not once per group. * [pipes`](https://r4ds.had.co.nz/pipes.html) in R for Data
Science.
Examples
arrange(<data>, <varible>)
arrange(<data>, desc(<variable>))
mutateCreate, modify, and delete columns
Useful mutate functions
+, -, log(), etc., for their usual mathematical meanings
lead(), lag()
dense_rank(), min_rank(), percent_rank(), row_number(), cume_dist(), ntile()
cumsum(), cummean(), cummin(), cummax(), cumany(), cumall()
na_if(), coalesce()
if_else(), recode(), case_when()
See https://ggplot2.tidyverse.org/reference/
ggplot(<data>) + geom_point(aes(x = <>, y = <>))
<data> |> ggplot() + geom_point(aes(x = <>, y = <>))
geom_point(): Pointsgeom_boxplot(): A box plotgeom_histogram(): Histogramsgeom_smooth(): Smoothed conditional meansgdpcap <- read_csv("data/gdpcap.csv")
Rows: 16758 Columns: 5── Column specification ─────────────────────────────────────────────────
Delimiter: ","
chr (3): country, iso2c, iso3c
dbl (2): year, NY.GDP.PCAP.KD
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdpcap
summary(gdpcap)
country iso2c iso3c year
Length:16758 Length:16758 Length:16758 Min. :1960
Class :character Class :character Class :character 1st Qu.:1975
Mode :character Mode :character Mode :character Median :1991
Mean :1991
3rd Qu.:2007
Max. :2022
NY.GDP.PCAP.KD
Min. : 144
1st Qu.: 1316
Median : 3574
Mean : 10911
3rd Qu.: 12504
Max. :204191
NA's :3962
gdpcap |> distinct(country, iso2c)
gdpcap |> filter(country == "World") |>
ggplot(aes(x = year, y = NY.GDP.PCAP.KD)) +
geom_line()
gdpcap |> filter(iso2c %in% c("BR", "RU", "IN", "CN")) |>
ggplot(aes(x = year, y = NY.GDP.PCAP.KD, color = country)) +
geom_line() +
labs(title = "GDP per capta of BRICs", y = "GDP per capita (constant 2015 US$)")
wb_countries above.(low_countries <- wb_countries |> filter(income == "Low income"))
(gdpcap_low <- semi_join(gdpcap, low_countries))
Joining with `by = join_by(country, iso3c)`
gdpcap_low |>
ggplot(aes(x = year, y = NY.GDP.PCAP.KD, color = country)) +
geom_line() +
labs(title = "GDP per capta of low income countries", y = "GDP per capita (constant 2015 US$)")
Let us use the saved data.
gdpcap_co2 <- read_csv("data/gdpcap_co2.csv")
Rows: 16758 Columns: 14── Column specification ─────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, NY.GDP.PCAP.KD, EN.ATM.CO2E.PC, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdpcap_co2
co2 <- gdpcap_co2 |>
select(country, iso2c, iso3c, year, co2 = EN.ATM.CO2E.PC, region)
co2 |> filter(country %in% c("France", "Germany", "Italy", "Japan", "United Kingdom", "United States", "Canada", "Russian Federation")) |> distinct(country, iso2c, iso3c, region)
Successfully chosen data of G8 countries.
co2 |> filter(country %in% c("France", "Germany", "Italy", "Japan", "United Kingdom", "United States", "Canada", "Russian Federation")) |>
filter(!is.na(co2)) |>
ggplot(aes(x = year, y = co2, color = country)) + geom_line() +
labs(title = "CO2 Emission Per Capita of G8 Countries")
Let us use the saved data.
gdpcap_co2 <- read_csv("data/gdpcap_co2.csv")
Rows: 16758 Columns: 14── Column specification ─────────────────────────────────────────────────
Delimiter: ","
chr (7): country, iso2c, iso3c, region, capital, income, lending
dbl (5): year, NY.GDP.PCAP.KD, EN.ATM.CO2E.PC, longitude, latitude
lgl (1): status
date (1): lastupdated
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
gdpcap_co2
summary(gdpcap_co2)
country iso2c iso3c year
Length:16758 Length:16758 Length:16758 Min. :1960
Class :character Class :character Class :character 1st Qu.:1975
Mode :character Mode :character Mode :character Median :1991
Mean :1991
3rd Qu.:2007
Max. :2022
status lastupdated NY.GDP.PCAP.KD EN.ATM.CO2E.PC
Mode:logical Min. :2023-09-19 Min. : 144 Min. : 0.000
NA's:16758 1st Qu.:2023-09-19 1st Qu.: 1316 1st Qu.: 0.677
Median :2023-09-19 Median : 3574 Median : 2.396
Mean :2023-09-19 Mean : 10911 Mean : 4.149
3rd Qu.:2023-09-19 3rd Qu.: 12504 3rd Qu.: 6.077
Max. :2023-09-19 Max. :204191 Max. :47.657
NA's :3962 NA's :9350
region capital longitude
Length:16758 Length:16758 Min. :-175.22
Class :character Class :character 1st Qu.: -15.18
Mode :character Mode :character Median : 19.54
Mean : 19.16
3rd Qu.: 50.53
Max. : 179.09
NA's :3528
latitude income lending
Min. :-41.286 Length:16758 Length:16758
1st Qu.: 4.174 Class :character Class :character
Median : 17.277 Mode :character Mode :character
Mean : 18.740
3rd Qu.: 39.715
Max. : 64.184
NA's :3528
Since the most recent year seems to be 2021, let us choose the year.
gdpcap_co2 |> filter(year == 2021) |>
ggplot(aes(x = NY.GDP.PCAP.KD, y = EN.ATM.CO2E.PC)) +
geom_point()
What is wrong? There seems to be many missing values. First use
filter to choose countries, i.e., non-aggregates and see
how many data are in each year.
gdpcap_co2 |> filter(region != "Aggregates", year == "2021")
Alternately, we can find the data points in each year except NA, i.e., not available.
gdpcap_co2 |> filter(region != "Aggregates", !is.na(NY.GDP.PCAP.KD), !is.na(EN.ATM.CO2E.PC)) |>
group_by(year) |> summarize(n = n())
gdpcap_co2 |> filter(region != "Aggregates", !is.na(NY.GDP.PCAP.KD), !is.na(EN.ATM.CO2E.PC), year == 2019) |>
ggplot(aes(x = NY.GDP.PCAP.KD, y = EN.ATM.CO2E.PC)) +
geom_point()
Most of the points are near origin. Let’s try the log-log scale.
gdpcap_co2 |> filter(region != "Aggregates", !is.na(NY.GDP.PCAP.KD), !is.na(EN.ATM.CO2E.PC), year == 2019) |>
ggplot(aes(x = log10(NY.GDP.PCAP.KD), y = log10(EN.ATM.CO2E.PC))) +
geom_point()
gdpco2_2019 <- gdpcap_co2 |> select(region, year, gdp = NY.GDP.PCAP.KD, co2 = EN.ATM.CO2E.PC) |>
filter(region != "Aggregates", !is.na(gdp), !is.na(co2), year == 2019)
gdpco2_2019 |> ggplot(aes(x = log10(gdp), y = log10(co2))) +
geom_point() +
labs(title = bquote(~CO[2]~ "Log10 Scale Plot of "~CO[2]~" Emission Per Capita Against GDP Per Capitain 2019"),
subtitle = bquote(~CO[2]~": metric tons per capita, GDP: constant 2015 US$"),
x = bquote(~CO[2]~ "Emission Per Capita (Log10))"), y = "GDP Per Capita (Log10)") +
geom_smooth(formula = y ~ x, method = "lm", se = FALSE) +
geom_smooth(formula = y ~ x, method = "loess", col = "red")
summary(gdpco2_2019)
region year gdp co2
Length:184 Min. :2019 Min. : 270.1 Min. : 0.03371
Class :character 1st Qu.:2019 1st Qu.: 1961.5 1st Qu.: 0.80489
Mode :character Median :2019 Median : 5771.4 Median : 2.82838
Mean :2019 Mean : 13522.6 Mean : 4.12167
3rd Qu.:2019 3rd Qu.: 16239.2 3rd Qu.: 5.59989
Max. :2019 Max. :106729.0 Max. :31.87720
lm(log10(co2) ~ log10(gdp), data = gdpco2_2019) |> summary()
Call:
lm(formula = log10(co2) ~ log10(gdp), data = gdpco2_2019)
Residuals:
Min 1Q Median 3Q Max
-0.82708 -0.18685 -0.03374 0.20325 0.65721
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.04187 0.13952 -21.80 <2e-16 ***
log10(gdp) 0.88952 0.03663 24.29 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2969 on 182 degrees of freedom
Multiple R-squared: 0.7642, Adjusted R-squared: 0.7629
F-statistic: 589.8 on 1 and 182 DF, p-value: < 2.2e-16
There are \(n = 184\) data. So gdp = \(x_1, x_2, \ldots, x_{184}\) and co2 = \(y_1, y_2, \ldots, y_{184}\). In log10 scale, log10(gdp) = \(\log_{10}(x_1), \log_{10}(x_2), \ldots, \log_{10}(x_{184})\) and log10(co2) = \(\log_{10}(y_1), \log_{10}(y_2), \ldots, \log_{10}(y_{184})\), and call them \(x_1', x_2', \ldots, x_{184}'\) and \(y_1', y_2', \ldots, y_{184}'\). The scatter plot is the points \((x_1', y_1'), (x_2', y_2'), \ldots, (x_{184}', y_{184}')\). Let \(\mathrm{mean}(x')\) and \(\mathrm{mean}(y')\) be mean of \(x_1', x_2', \ldots, x_{184}'\) and \(y_1', y_2', \ldots, y_{184}'\).
\[\mathrm{SSxx} = \sum_{i=1}^{184} (x_i' - \mathrm{mean}(x'))^2, \quad \mathrm{SSxy} = \sum_{i=1}^{184} (x_i' - \mathrm{mean}(x'))(y_i' - \mathrm{mean}(y')).\] Then, by theory, we know the slope and the y-inetrcept as follows. \[\mathrm{Slope} = \frac{\mathrm{SSxy}}{\mathrm{SSxx}}, \quad \mathrm{Intercept} = \mathrm{mean}(y') - \mathrm{Slope}\cdot \mathrm{mean}(x').\]
gdpco2log_ssxx <- sum((log10(gdpco2_2019$gdp)- mean(log10(gdpco2_2019$gdp)))^2)
gdpco2log_ssxy <- sum((log10(gdpco2_2019$gdp)- mean(log10(gdpco2_2019$gdp)))*(log10(gdpco2_2019$co2)- mean(log10(gdpco2_2019$co2))))
gdpco2log_slope <- gdpco2log_ssxy/gdpco2log_ssxx
gdpco2log_intercept <- mean(log10(gdpco2_2019$co2)) - gdpco2log_slope*mean(log10(gdpco2_2019$gdp))
c(Intercept = gdpco2log_intercept, Slope = gdpco2log_slope)
Intercept Slope
-3.0418722 0.8895225
Now consider, the squared difference at \(\log_{10}(x_i)\) between the actual value \(\log_{10}(y_i)\) and the linear estimation \(m\cdot \log_{10}(x_i) + b\). Since it is a square, it is positive. Now take the sum and want to determine \(m\) and \(b\) to minimize the sum, which is called the errors. \[\mathrm{SSE} = \sum_{i=1}^{184} (m\cdot \log_{10}(x_i) + b - \log_{10}(y_i))^2 = \sum_{i=1}^{184} (m\cdot x_i' + b - y_i')^2.\] Let \(k\) be the number of variables for estimation. In our case we estimate using only one variable, we set \(k=1\). The following value is called the Residual Standard Error. \(n-(k+1) = 182\) is called the degree of freedom. \[\mbox{Residual Standard Error} = \sqrt{\frac{\mathrm{SSE}}{n-(k+1)}} = \sqrt{\frac{\mathrm{SSE}}{182}}\]
gdpco2log_model <- lm(log10(co2) ~ log10(gdp), data = gdpco2_2019)
gdpco2log_sse <- sum((log10(gdpco2_2019$co2) - gdpco2log_model$fitted.values)^2)
(gdpco2log_rse <- sqrt(gdpco2log_sse/182))
[1] 0.2969318
The estimated value of \(b\) appears in the row (Intercept) and \(m\) appears in the row log10(gdp). Roughly speaking it is the best fit line against the data points minimizing the sum of the squared distances.
To normalize the scale, we divide the value \(\mathrm{SSyy}\) and consider the following. \[\mathrm{SSyy} = \sum_{i=1}^{184} (\log_{10}(y_i) - \mathrm{mean})^2 = \sum_{i=1}^{184} (y_i'-\mathrm{mean})^2, \quad \mbox{Multiple R Squared} = 1-\frac{\mathrm{SSE}}{\mathrm{SSyy}}\]
gdpco2log_ssyy <- sum((log10(gdpco2_2019$co2)- mean(log10(gdpco2_2019$co2)))^2)
1-gdpco2log_sse/gdpco2log_ssyy
[1] 0.764188
Conclusion: Roughly speaking, the linear model fits much better than the estimation by means and the model explains about 77% of the scattered points.
You can read many excellent books online.
learnr package help you to review and consolidate your
understanding of basis of R.